read_csv('A,B
12:00, 12:00
14:30, midnight
20:01, noon')# A tibble: 3 × 2
A B
<time> <chr>
1 12:00 12:00
2 14:30 midnight
3 20:01 noon
Lecture 6:
Non-rectangular data
2024-10-24
readr, read_csv(), read_delim() to parse csvs
tibble.guess_parser()tidyverse()read.csv(): “invalid multibyte string at read_delim(): more robust to encoding issues!With iconv(), we found the right encoding for \xF6. We could then import the data using
However, because of the special character, the column was read as a character. We replaced the value and coerced the whole column to numeric.
Using:
“If the two arguments are atomic vectors of different types, one is coerced to the type of the other, the (decreasing) order of precedence being character, complex, numeric, integer, logical and raw.”
This question checks your understanding of the coercion of boolean values.
What is the output of the following code?
c("TRUE", "FALSE", "TRUE", "FALSE")c(TRUE, FALSE, TRUE, FALSE)c(1,0,1,0)c(0,1,0,1)Consider the data frame
Which of these statements are TRUE?
dataCHframe$PartyCenter <- c(25, 20, 15) creates a new variable called “PartyCenter”dim(dataCHframe[, dataCHframe$PartyLeft > 40]) returns the same as dim(dataCHframe[, c(2,3)])dim(dataCHframe[dataCHframe$PartyLeft > 40 | dataCHframe$PartyLeft < 40, ]) returns c(3,3)dataCHframe is a data.frame, which is a list consisting of one named character vector and two named integer vectors.You want to import a file using read_delim(). Describe what read_delim() does under the hood. What should be added to this command in order for it to work?
Consider the following code
Are these statements TRUE or FALSE?
mean(df$a) == 2.5typeof(as.matrix(df)[,1]) is numeric (or double)Today
Understand non-rectangular data: xml, json, and html
Be familiar with the way we parse these data into R
Guest lecture:
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
<row>
<unique_id>216498</unique_id>
<indicator_id>386</indicator_id>
<name>Ozone (O3)</name>
<measure>Mean</measure>
<measure_info>ppb</measure_info>
<geo_type_name>CD</geo_type_name>
<geo_join_id>313</geo_join_id>
<geo_place_name>Coney Island (CD13)</geo_place_name>
<time_period>Summer 2013</time_period>
<start_date>2013-06-01T00:00:00</start_date>
<data_value>34.64</data_value>
</row>
<unique_id>216499</unique_id>
<indicator_id>386</indicator_id>
...
</row>The “row-content” is nested between the ‘row’-tags:
There are two principal ways to link variable names to values.
<variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable>
<filename>ISCCPMonthly_avg.nc</filename>
<filepath>/usr/local/fer_data/data/</filepath>
<badflag>-1.E+34</badflag>
<subset>48 points (TIME)</subset>
<longitude>123.8W(-123.8)</longitude>
<latitude>48.8S</latitude>
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" /><filename>ISCCPMonthly_avg.nc</filename>.<case date="16-JAN-1994" temperature="9.200012" />.Attributes-based:
Tag-based:
<cases>
<case>
<date>16-JAN-1994<date/>
<temperature>9.200012<temperature/>
<case/>
<case>
<date>16-FEB-1994<date/>
<temperature>10.70001<temperature/>
<case/>
<case>
<date>16-MAR-1994<date/>
<temperature>7.5<temperature/>
<case/>
<case>
<date>16-APR-1994<date/>
<temperature>8.100006<temperature/>
<case/>
<cases/>Potential drawback of XML: inefficient storage
<?xml version="1.0" encoding="UTF-8"?>
<company_dundermifflin>
<person id="1">
<name>Michael Scott</name>
<position>Regional Manager</position>
<location branch="Scranton"/>
</person>
<person id="2">
<name>Dwight Schrutte</name>
<position>Assistant (to the) Regional Manager</position>
<location branch="Scranton"/>
<orders>
<sales>
<units>10</units>
<product>paper A4</product>
</sales>
</orders>
</person>
<person id="3">
<name>Jim Halpert</name>
<position>Sales Representative</position>
<location branch="Scranton"/>
<orders>
<sales>
<units>20</units>
<product>paper A4</product>
</sales>
<sales>
<units>5</units>
<product>paper A3</product>
</sales>
</orders>
</person>
</company_dundermifflin>
{xml_document}
<company_dundermifflin>
[1] <person id="1">\n <name>Michael Scott</name>\n <position>Regional Manager</position>\n <lo ...
[2] <person id="2">\n <name>Dwight Schrutte</name>\n <position>Assistant (to the) Regional Mana ...
[3] <person id="3">\n <name>Jim Halpert</name>\n <position>Sales Representative</position>\n < ...
‘company_dundermifflin’ is the root-node, ‘persons’ are its children:
{xml_nodeset (3)}
[1] <person id="1">\n <name>Michael Scott</name>\n <position>Regional Manager</position>\n <lo ...
[2] <person id="2">\n <name>Dwight Schrutte</name>\n <position>Assistant (to the) Regional Mana ...
[3] <person id="3">\n <name>Jim Halpert</name>\n <position>Sales Representative</position>\n < ...
{xml_nodeset (11)}
[1] <name>Michael Scott</name>
[2] <position>Regional Manager</position>
[3] <location branch="Scranton"/>
[4] <name>Dwight Schrutte</name>
[5] <position>Assistant (to the) Regional Manager</position>
[6] <location branch="Scranton"/>
[7] <orders>\n <sales>\n <units>10</units>\n <product>paper A4</product>\n </sales>\n</o ...
[8] <name>Jim Halpert</name>
[9] <position>Sales Representative</position>
[10] <location branch="Scranton"/>
[11] <orders>\n <sales>\n <units>20</units>\n <product>paper A4</product>\n </sales>\n < ...
{xml_nodeset (1)}
[1] <person id="1">\n <name>Michael Scott</name>\n <position>Regional Manager</position>\n <lo ...
{xml_nodeset (2)}
[1] <person id="2">\n <name>Dwight Schrutte</name>\n <position>Assistant (to the) Regional Mana ...
[2] <person id="3">\n <name>Jim Halpert</name>\n <position>Sales Representative</position>\n < ...
{xml_nodeset (1)}
[1] <company_dundermifflin>\n <person id="1">\n <name>Michael Scott</name>\n <position>Reg ...
{xml_nodeset (4)}
[1] <person id="1">\n <name>Michael Scott</name>\n <position>Regional Manager</position>\n <lo ...
[2] <company_dundermifflin>\n <person id="1">\n <name>Michael Scott</name>\n <position>Reg ...
[3] <person id="2">\n <name>Dwight Schrutte</name>\n <position>Assistant (to the) Regional Mana ...
[4] <person id="3">\n <name>Jim Halpert</name>\n <position>Sales Representative</position>\n < ...
{xml_nodeset (1)}
[1] <company_dundermifflin>\n <person id="1">\n <name>Michael Scott</name>\n <position>Reg ...
{xml_nodeset (3)}
[1] <person id="1">\n <name>Michael Scott</name>\n <position>Regional Manager</position>\n <lo ...
[2] <person id="2">\n <name>Dwight Schrutte</name>\n <position>Assistant (to the) Regional Mana ...
[3] <person id="3">\n <name>Jim Halpert</name>\n <position>Sales Representative</position>\n < ...
{xml_nodeset (3)}
[1] <sales>\n <units>10</units>\n <product>paper A4</product>\n</sales>
[2] <sales>\n <units>20</units>\n <product>paper A4</product>\n</sales>
[3] <sales>\n <units>5</units>\n <product>paper A3</product>\n</sales>
[1] "10paper A4" "20paper A4" "5paper A3"
{xml_nodeset (3)}
[1] <name>Michael Scott</name>
[2] <name>Dwight Schrutte</name>
[3] <name>Jim Halpert</name>
XML:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumber>
<type>home</type>
<number>212 555-1234</number>
</phoneNumber>
<phoneNumber>
<type>fax</type>
<number>646 555-4567</number>
</phoneNumber>
<gender>
<type>male</type>
</gender>
</person>JSON:
{"firstName": "John",
"lastName": "Smith",
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
},
"phoneNumber": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": "646 555-4567"
}
],
"gender": {
"type": "male"
}
}List of 6
$ firstName : chr "John"
$ lastName : chr "Smith"
$ age : int 25
$ address :List of 4
..$ streetAddress: chr "21 2nd Street"
..$ city : chr "New York"
..$ state : chr "NY"
..$ postalCode : chr "10021"
$ phoneNumber:'data.frame': 2 obs. of 2 variables:
..$ type : chr [1:2] "home" "fax"
..$ number: chr [1:2] "212 555-1234" "646 555-4567"
$ gender :List of 1
..$ type: chr "male"
The nesting structure is represented as a nested list:
HyperText Markup Language (HTML), designed to be read by a web browser.
HTML documents/webpages consist of ‘semi-structured data’:
In this example, we look at Wikipedia’s Economy of Switzerland page.
Source: https://en.wikipedia.org/wiki/Economy_of_Switzerland
-> Exercise session this afternoon!
Text is unstructured data. Text analysis and feature extraction is the basis for new genAI models!
-> check the code example on Canvas.
R.